U5008479225B2 


cə) United States Patent 


Covell et al. 


US 8,479,225 BZ2 
dul. 2, 2013 


0) Patent No.: 
(45) Date of Patent: 


(54) SOCIAL AND INTERACTIVE APPLICATIONS (56) References Cited 
FOR MASS MEDLA 
U.S. PATENT DOCUMENTS 
(75) Inventors: Michele Covell, Palo Alto, CA (US): 4,811,399 A 3/1989 Landell et al. 
Shumeet Baluia, Santa Clara, CA (US), 5,706,364 A 1/1998 Kopec et al. 
Michael Fink, Brookline, MA (US) 5,870,744 A 2/1999 Sprague 
6,023,693 A 2/2000 Masuoka et al. 
i 2 : R 6,044,365 A 3/2000 Cannon et al. 
(73) Assignee: Google Inc., Mountain Vievv, CA (US) 6.236.758 BL 5/2001 Sodagar et al. 
6,494,720 BI 12/2002 M itsch 
(7) Notice: Subfect to any disclaimer, the term of this 6,529,526 B1 3/2003 25 
patent is extended or adfusted under 35 6,563,909 B2 5/2003 Schmitz 
U.S.C. 154(b) by 521 days. 6,585,521 Bİ 7/2003 Obrador 
6,704,920 B2 3/2004 Brill et al. 
2 6,751,601 B2 6/2004 Zegers 
(00 o o 6,754,667 BX 6/2004 Kimet al. 
(22) Filed Nov. 27. 2006 6,763,339 B2 7/2004 Fu et al. 
iled: ov. 
: (Continued) 
(65) Prior Publication Data FOREIGN PATENT DOCUMENTS 
US 2007/0130580 A1 Din. 7, 2007 EP 1524857 4/2005 
TP 2002-209204 7/2002 
Related U.S. Application Data ?P 2004043438 2/2004 
(60) Provisional application No. 60/740,760, filed on Nov. OTHER PUBLICATIONS 
.. 2. 57 qalıram: 000 00, Yang, Chen. “MACS: Music Audio Characteristic Sequence Index- 
Ede : ing for Similarity Retrieval”. Oct. 21-24, 2001, Nevv Paltz, Nevv 
(51) mt. CI. .... 
HüZH 60/32 (2008.01) (Continued) 
G06F 3/00 (2006.01) 
G06F 13/00 2006.01 
HAN 5/445 . .. Primary Examiner — Benneft Ingvoldstad 
H04H 60/58 (2008.01) (74) Attorney, Agent, or Firm — Fish öz Richardson P.C. 
HOAN 21/439 (2011.01) 
HOZAN 21/462 (2011.01) (57) ABSTRACT 
(52) U.S.CI. 5 
CPC ..)1 H04H 60/58 (2013.01): HÜAN 21/4394 Systems, methods, apparatuses, user interfaces and computer 
(2013.01): HO4N 2 4 622 (2013.01) program products provide social and interactive applications 
USPC ............................ 725/18: 725/40, 725/51 İOr mass media based on real-time ambient-audio and/or 
(58) Field of Classification Search yığxısdanulıcanıı 
USBE məzə əsa nəə 725/37 


See application file for complete search history. 


39 Claims, 7 Dravving Sheets 


Ambient-Audio ldentification System (Client-Side) 
200 


210 


Ambient-Audio 
Detector 
204 


Netvvork Access Device 
206 


Client-Side Interface 


- İ stem 
Mass-Media Sy: 102 


202 


US 8,479,225 B2 
Page 2 


U.S. PATENT DOCUMENTS 


6,766,523 B2 7/2004 Herley 

6,773,266 Bİ 8/2004 Dornbush et al. 

6,782,186 B1 8/2004 Covell et al. 

6,879,967 B1 4/2005 Stork 

6,892,191 Bl1 5/2005 Sechaffer 

6,895,514 B1 5/2005 Kermani 

6,944,632 B2 9/2005 Stern 

6,970,131 B2 11/2005 Percy et al. 

7,103,801 B2 9/2006 Marilly et al. 

7,107,207 B2 9/2006 Goodman 

7,266,492 B2 9/2007 Goodman 

7,281,219 B2 10/2007 Hamilton et al. 

7,375,304 B2 5/2008 Kainec et al. 

7,386,479 B2 6/2008 Mizuno 

7,472,096 B2 12/2008 Burges et al. 

7,617,164 B2 11/2009 Burges et al. 

7,831,531 BI 11/2010 Balufa et al. 
2002/0023020 A1” 2/2002 Kenyon et al. .................. 705/26 
2002/0133499 A1” 9/2002 VVard et al. .................... 707/102 


2002/0194585 Al 
2003/0033223 Al 
2003/0093790 A17 5/2003 Logan et al. .................... 725/38 
2003/0101144 Al 5/2003 Moreno 

2003/0225833 A1” 12/2003 Pilat et al. ..................... 709/204 
2004/0025174 Al 2/2004 Cerrato et al. 

2004/003 1058 Al 2/2004 Reisman 

2004/0128682 Al 7/2004 Liga et al. 

2004/0199387 Al 10/2004 VVang et al. 

2005/0066352 Al 3/2005 Herley 

2005/0086682 Al 4/2005 Burges et al. 

2005/0096920 Al 5/2005 Matz et al. 

2005/0147256 A1” 7/2005 Peters et al. .................... 381/56 
2005/0193016 Al 9/2005 Seet et al. 

2005/0283792 Al 12/2005 Svvix et al. 

2006/0080356 Al 4/2006 Burges et al. 

2006/0174348 Al” 8/2006 Rhoads et al. .................. 726/26 
2007/0124756 Al 5/2007 Covell et al. 

2007/0143778 Al 6/2007 Covell et al. 

2008/0090551 Al 4/2008 Gidron et al. 

2008/0263041 Al 10/2008 Cheung 


OTHER PUBLICATIONS 


“Compression” definition. Oxford English Dictionary. Accessed Apr. 
27, 2009. http://dictionary.oed.com/cgi/entry/50045890?single” 
1öcquery type”vvordözqueryvvord”compression.” 

“Database” definition. Oxford English Dictionary. Accessed Apr. 27, 
2009. http://dictionary.oed.com/cgi/entry/50057772?single” 
16:query- type”vvordözqueryvvord”database.” 

“Encrypt” definition. Oxford English Dictionary. Accessed Apr. 27, 
2009. http://dictionary.oed.com/cgi/entry/00292459?single” 
16:query- type”vvordözqueryvvord”encrypt.” 

“Community” definition. Oxford English Dictionary. Accessed Apr. 
27, 2009. http://dictionary.oed.com/cgi/entry/5004524 1?single” 
1ö6:query- type”vvordözqueryvvord“community.” 

PCT International Search Report in corresponding PCT application 
4£PCT/U906/45549 dated Oct. 9, 2007, 2 pages. 

U.S. Appi. No. 11/468,265, filed Aug. 29, 2006, Covell et al. 
Google, Inc., İnternational Search Report and VVritten Opinion ofthe 
International Application No. PCT/U806/45551 dated Tul. 21, 2008, 
16 pages. 

Burges et al., “Duplicate Detection and Audio Thumbnails vvith 
Audio Fingerprinting” lonlinel. 2004, İretrieved on Nov. 21, 20061. 
Retrieved on the Internet: S€URL: vvvvvv.research.microsoft. 
com/—cburges/tech  reports/tr-2004-19.pdf”, 5 pages. 

Cano et al., “A Revievv of Algorithms for Audio Fingerprinting” 
Tonlinel. 2002, fretrieved on Nov. 21, 20061. Retrieved from the 
Internet: “URL:  vvvvvv.iua.upf.es/ mtg/publications/MMSP-2002- 
pcano.pdf”, 5 pages. 

Haitsma and Kalker, “A Highiy Robust Audio Fingerprinting Sys- 
tem” Tonline1. 2002, İretrieved on Nov. 16, 20061. Retrieved from the 
Internet: “URL: vvvvvv.ismir2002.ismir.net/proceedings/02-FP04-2. 
pdf”, 9 pages. 

Tacobs et al,, “Fast Multiresolution Lmage Querying” lonlinel. 1995, 
İetrieved on Nov. 21, 20061. Retrieved from the Internet: “URL: 
VVVVVV.gTail.cs.vvashington.edu/profects/query/ mrquery.pdf”, 10 


pages. 


12/2002 Connelly 
2/2003 Mizuno 


Ke et al., “Computer Vision for Music Identification” lonlinel. 2005, 
Fetrieved on Nov. 21, 20061. Retrieved from the Internet: “URL: 
VVVVVV.CS.cmu.edu/—yke/musicretrieval/cvpr2005-mr.pdf”, $ pages. 
Shazazm, “Shazam Entertainment Brings Music Recognition to 
VVindovvs Mobile 5.0 Povvered Smartphones” Tonlinel. 2006, 
Hetrieved on Nov. 16, 20061. Retrieved from the Internet: “URL: 
VVVVVV..Shazam.com/music/ portal/sp/s/media-type/html/user/anon/ 
page/default/template/pages/p/company release30.html”, 1 page. 
Stanford, “CS276 Information Retrieval and VVeb Mining” Tonlinel. 
2005, İretrieved on Nov. 16, 20061. Retrieved from the Tnternet: 
SURL: vvvvvv.stanford.edu/class/cs276/handouts/lecture19.pdf”, $ 
pages. 

Stanford, “Data Mining: Associations” lonlinel. 2002, lretrieved on 
Nov. 16, 20061. Retrieved from the Internet: “URL: vvvvvv.stanford. 
edu /class/cs206/cs206-2.pdf5, 11 pages. 

Stollnitz et al., “VVavelets for Computer Graphics: A Primer, Part 1” 
Tonlinel. 1995, fretrieved on Nov. 21, 20061. Retrieved from the 
Internet: “URL: vvvvvv.grail.cs.vvashington .edu/pub/stoll/vvaveletl. 
pdf”, $ pages. 

Stollnitz et al., “VVavelets for Computer Graphics: A Primer, Part 2” 
Tonlinel. 1995, fretrieved on Nov. 21, 20061. Retrieved from the 
Internet: “URL: vvvvvv.grail.cs.vvashington.edu/pub/stoll/vvavelet2. 
pdf”, 9 pages. 

VVang, “The Shazam Music Recognition Service,” Communications 
of the ACM, Aug. 2006, 49(8): 5 pages. 

European Search Report, EP Application No.08 15 3719 mailed Sep. 
26, 2008, $ pages. 

Viola and Tones, “Robust Real-Time Obiect Detection,” /ar. // Com- 
puter Vision, 2002, 25 pages. 

International Preliminary Report on Patentability, Application No. 
PCT/U806/45549 mailed Hun. 12, 2008, 7 pages. 

International Preliminary Report on Patentability, Application No. 
PCT/U806/45551 mailed Apr. 2, 2009, 11 pages. 

Burges et al., “Using Audio Fingerprinting for Duplicate Detection 
and Thumbnail Generation,” Mar. 2005, 4 pages. 

Cohen et al., “Finding Interesting Associations vvithout Support 
Pruning,” 2001, Retrieved from the Internet: “URL: vvvvvv.dbis. 
informatik.huberlin .de/dbisold/lehre/VVS0405/KDD/paper/CDFG. 
00.pdf”., 12 pages. 

“Shazam Experience Music” fonlinel. Hretrieved on May 30, 20071. 
Retrieved from the Internet: “URL: vvvvvv.shazam.com/ music/portal/ 
sp/s/media-type/html/user/anon/page/default/template/Myhome/ 
music.html”, 2 pages. 

A Multi-layer Adaptive Function Neural Netvvork (MADFUNN) for 
Analytical Function Recognition, Miao Kang: Palmer-Brovvn, D.: 
Neural Netvvorks, 2006, LECNN "06. International Toint Conference 
on Digital Obfect Identifier 10.1109/1/CNN.2006.246895 Publica- 
tion Year 2006, pp. 1784-1789. 

An adaptive function neural netvvork ADFUNN) for phrase recogni- 
tion, Miao Kang: Palmer-Brovvn, D., Neural Netvvorks, 2005. 1CNN 
”05. Proceedings 2005 TEEE Tnternational Toint Conference on vol. 1, 
Digital Obfect Identifier: 10.1109/17CNN.2005.1555898$ Publication 
Year 2005, pp. 593-597, vol. 1. 

An all-phoneme ergodic HMM for unsupervised speaker adaptation, 
Miyazavva, Y., Acoustics, Speech, and Signal Processing, 1993, 
ICASSP-93, 1993 TEEE Tnternational Conference on vol. 2, Digital 
Obyect Identifier 10.1109/1CASSP. 1993.319372, Publication Year 
1993, pp. 574-577, vol. 2. 

An information theoretic approach to adaptive system training using 
unlabeled data Kyu-Hvva Teong: Tian-VVu Xu: Principe, 7T.C., Neural 
Netvvorks, 2005. ITCNN ”05. Proceedings. 2005 TEEE Tnternational 
Toint Conference on vol. 1, hul. 31-Aug. 4, 2005, pp. 191-195, vol. 1 
Digital Obyect Identifier 10.1109/1VCNN.2005.1555828. 
Connectionist training of non-linear hidden Markov models for 
speech recognition, Zhao, Z., Neural Netvvorks, 1991. 1991 TEEE 
International Toint Conference on Nov. 18-21, 1991, pp. 1647-1652, 
vol. 2, Digital Obfect Identifier 10.1109/1VCNN.1991.170645. 

On adaptive acquisition of language, Gorin, AL., Levinson, S.E., 
Miller, L.G., Gertner, A.N.: Liolye, Al., Goldman, E.R., Acoustics, 
Speech, and Signal Processing, 1990. ICASSP-90, 1990 Tnterna- 
tional Conference on Digital Obyect Identifter: 10.1109/1CASSP. 
1990.115784 Publication Year 1990, pp. 601-604, vol. 1. 


US 8,479,225 B2 
Page 3 


Training neural netvvorks vvith additive noise in the desired signal 
Chuan VVang: Principe, /.C.: Neural Netvvorks, TEEE Transactions on 
vol. 10, Issue 6, Nov. 1999, pp. 1511-1517. 

Training neural netvvorks vvith additive noise in the desired signal 
Chuan VVang: Principe, /.C., Neural Netvvorks Proceedings, 1998. 
TEEE VvVorld Congress on Computational Tntelligence. The 1998 
TEEE International Toint Conference on vol. 2, May 4-9, 1998, pp. 
1084-1089, vol. 2, Digital Obiect Identifier 10.1109/1/CNN.1998. 
685923. 

Chinese Patent Office Action for Application No. 200680051559.0 
dated 1an. 22, 2010, 14 pages. 

Gauch, 1. M. et al,, “Identification of Nevv Commercials Using 
Repeated Video Sequence Detection,” Sep. 11, 2005, Image Process- 
ing, 2005, ICIP 2005, TEEE Tnternational Conference on Genova, 
Italy Sep. 11-14, 2005, Piscatavvay, N/7, USA, TEEE, pp. 1252-1255. 
Sadlier et al,, “Automatic TV Advertisement Detection from MPEG 
Bitstream,” 2001, Pattern Recognition, vol. 35, Issue 12, pp. 2719- 
2726. 

Supplemental EP Search Report dated Feb. 16, 2010, 97 pages. 
Ding et al., “Robust Technologies tovvards Automatic Speech Rec- 
ognition in Car Noise Environments,” Signal Processing, 2006 8th 
International Conference on vol. 1, Publication Year: 2006, 4 pages. 
Vveixin et al., “Learning to Rank Using Semantic Features in Docu- 
ment Retrieval,” Intelligent Systems, 2009. GCIS ”09. VVRI Global 
Congress on vol. 3, Publication Year: 2009, pp. 500-504. 


Lin et al., “Input Data Representation for Self-Organizing Map in 
Sofivvare Classification,” Knovvledge Acquisition and Modeling, 
2009. KAM ”09. Second International Symposium on vol. 2, Publi- 
cation Year: 2009, pp. 350-353. 

Notice of Reasons for Refection for Tapanese Patent Application No. 
2008-543391, mailed Dec. 13, 2011, 3 pages. 

Robust Technologies tovvards Automatic Speech Recognition in Car 
Noise Environments, Ding, P., He, L., Yan, X., Zhao, R.: Hao, 1.: 
Signal Processing, 2006 Sth International Conference on vol. 1, Digi- 
tal Obfect Identifier: 10.11 09/1COSP.2006.345538 Publication Year: 
2006. 

Learning to Rank Using Semantic Features in Document Retrieval, 
Tian VVeixin: Zhu Fuxi, Intelligent Systems, 2009. GCIS ”09. VVRI 
Global Congress on vol. 3 Digital Obyect Identifter: 10.11 09/GCIS. 
2009.148, Publication Year: 2009 , pp. 500-504. 

put Data Representation for Self-Organizing Map in Softvvare 
Classification, Yuqing Lin: Huilin Ye, Knovvledge Acquisition and 
Modeling, 2009. KAM 709. Second International Symposium on vol. 
2, Digital Obyect TIdentifter: 10.11 09/KAM.2009.151 Publication 
Year: 2009, pp. 350-353. 

Office Action in Tapanese Application No. 2008-543391, mailed hun. 
5, 2012, 2 pages. 

Response to Final Office Action submitted Tul. 24, 2012, in U.S. 
Appli. No. 11/563,653, 18 pages. 


” cited by examiner 


US 8,479,225 B2 


Sheet 1 of 7 


Ful. 2, 2013 


U.S. Patent 


, əimnbi4 


90) 


yük 


ei 1əAlƏS OL 
uoneolıdd 23 
Əba 7 əə İ ƏSEQEyEQ olpny qeyec olpnv 


80) 
YHONMŞƏN 


cür 
əəpHəşul 


əpiS-3uəlo 


col 
əəeHəş)ul 


ƏpiS3uəllo 


col 
əoeHə)ul 


əpissuəlo 


00) 
ulə)SAS uonezileuosiəçi ssELVy 


US 8,479,225 B2 


Sheet 2 of7 


dul. 2, 2013 


U.S. Patent 


90z 


z əanbi- 


cüc 
uHƏ)SAŞ EIPƏİH-SSEH/N 


cü 
əoeHəlul əpis3uəlo 


40)2əş)əq 
olpny-şuəlquly 


0L€ 


00C 
(əpIS-3uəll)) uləşsAS uonEXLNMNnüuəpi olpny-yuəlquy 


U.S. Patent Tul, 2, 2013 


Client 


Monitor and Encode Auditory 
Environment andrTor Video 
302 


Submit Query Descriptor To 
Social Application Server 
304 


Client Receives Personalized 
İnformation 8, Ratings From 
Social App. Server 
314 


Present Ul VVith 
Personalized information 8. 
Ratings 

316 


Sheet 3 of7 


Figure 3 


US 8,479,225 B2 


Servers 


Audio DB Server 
Receives Query From Client 
306 


Audio DB Server 
Determines Content 
Matches From Query 
308 


Content Matches 


Social App. Server 
Aggregates Personalized 
İnformation 8. Ratings 
310 


Personalized 
İnformation 4 Ratings 


Social App. Server Sends 
Personalized information 8, 
Ratings To Client 
312 


Store Ratings In Database 
For Use By Other Processes 
318 


U.S. Patent Ful. 2, 2013 Sheet 4 of 7 US 8.,479,225 B2 


Audio Fingerprinting Process 
400 


Decompose Each Audio Snippet İnto Overlapping Frames 
402 


Convert Each Frame into Query Descriptors Trained To 
Overcome Audio Noise 8. Distortion 
404 


Match Descriptors To Database of Reference Descriptors 
406 


Determine Reference Descriptor Candidates 
From Database Return Hits 
408 


Determine High Score Candidates That Are Temporally 
Consistent VVith Query Descriptor 
410 


Send List Of High Scoring Candidates 


To Social Application Server 
412 


Figure 4 


U.S. Patent Ful. 2, 2013 Sheet 5 of 7 US 8.,479,225 B2 


İSooge 0002684 8Ğ€4€(7YY1(4c“ Timix 


File Edit Vievv Bookmarks Tools Help 


Eu 
alolOlEeezzz— xi Fo) 
aL— 


İL1Getting Startedİ” 1 Latest Headlines 


512 


Real-Time 
Popularity Ratings 
Related To Video 
Content (Seinfeld) 


Personalized Layer 
Related To Video Content 
(Seinfeld) 


Ad Hoc Peer Comm unitiles 5 members of your College Friends Group are currentiy vvatching CNNH (2005.07.01) 23:50 


504 Chat Room Related To Video Content 
(Seinfeld) 


Sponsored Links 
Related To 
Video Content (Seinfeld) 


doe: VVhat do think of Seinfeld? 


Sara: 1 think he”s boring? və . 
i i in 518 
: : ? 
Kevin: 1 think he”s funny? Link 3 


514 


VVhat else is on TV tonight? 


My Video Library: CNNITUA3:50) “Ə Seinfeld sy CNNH -— 
522 


Video Content 
(Scene From Seinfeld) 


Figure 5 


US 8,479,225 B2 


Sheet 6 of 7 


dul. 2, 2013 


U.S. Patent 


zc9 
0z9 
89 
919 


yiş 


g əinbi-i 


ARI 


suoneolıdüdy səuyo 


AS 3uəllə 


əlnpoyy suoneolunuluo2 X4OMŞƏN 


urə)sAs Bunesədo 


əuoudoləlıq ul 


“Hinq/əseHəşui 
əuoudololıy 


909 


009 
ulə)s4$ 3uəllə 


0/9 


(s)əəlAəq 3ndui 


(s)əəlAəq 


Aeydsıq 


yo9 


809 


(s)əsenəşul 


YuOAAŞƏN 


(s)İHossəəolçi 


c09 


U.S. Patent Ful. 2, 2013 Sheet 7 of 7 US 8.,479,225 B2 


Repetition Detection Process 
700 


Generate Database Of Audio Statistics From Content 


702 


Generate Query From Database 8, Run Query Against Database 
To Determine Non-ldentity Matches 
704 


İdentify Content Corresponding To Matched Query 
As Repeating Content 
706 


Find Endpoints For Repeating Content 
708 


Validate Repeating Content Using Non-Auditory Information 
(optional) 
710 


Appiy Metrics To Repeating Content To ldentify Advertisements 
(optional) 
712 


Figure 7 


US 8,479,225 B2 


1 
SOCLAL AND INTERACTTVE APPLICATIONS 
FOR MASS MEDLA 


RELATED APPLICATIONS 


This application claims the benefit of priority from U.S. 
Provisional Patent Application No. 60/740,760, for “Environ- 
ment-Based Referrals,” filed Nov. 29, 2005, vvhich applica- 
tion is incorporated by reference herein its entirety. 

This application claims the benefit of priority from U.S. 
Provisional Patent Application No. 60/823,881, for “Audio 
1dentification Based on Signatures,” filed Aug. 29, 2006, 
vvhich application is incorporated by reference herein its 
entirety. 

This application is related to U.S. patent application Ser. 
No. 11/563,653, for “Determining Popularity Ratings Using 
Social and Interactive Applications For Mass Media,” filed 
Nov. 27, 2006, and U.S. patent application Ser. No. 11/563, 
665, for “Detecting Repeating Content In Media,” filed Nov. 
27, 2006. Each ofthese patent applications is incorporated by 
reference herein in its entirety. 


TECFHNICAL FIELD 


The disclosed implementations are related to social and 
interactive applications for mass media. 


BACKGROUND 


Mass media channels (e.g., television and radio broad- 
casts) typically provide limited content to a large audience. 
By contrast, the VVorld VVide VVeb provides vast amounts of 
information that may only interest a fevv individuals. Conven- 
tional interactive television attempts to bridge these tvvo com- 
munication mediums by providing a means for vievvers to 
interact vvith their televisions and to receive content and/or 
services related to television broadcasts. 

Conventional interactive television is typically only avail- 
able to vievvers through cable or satellite netvvorks for a 
subscription fee. To receive interactive television service the 
vievver has to rent or buy a set-top box and have it installed by 
a technician. The vievver”s television is connected fo the set- 
top box, vvhich enables the vievver to interact vvith the televi- 
sion using a remote control or other input device, and to 
yeceive information, entertainment and services (e.g., adver- 
tisements, online shopping, forms and surveys, games and 
activities, etc.). 

VVhile conventional interactive television can improve the 
vievver”s television experience, there remains a need for 
social and interactive applications for mass medta that do not 
yely on significant additional hardvvare or physical connec- 
tions betvveen the television or radio and a set-top box or 
computer. 

One social and 1interactive television application that is 
lacking vvith conventional and interactive television systems 
is the ability to provide complementary information to the 
mass media channel in an effortless manner. VVith conven- 
tional systems, a user vvould have to log-on to a computer and 
query for such information vvhich vvould diminish the passive 
experience offered by mass media. Moreover, conventional 
television systems cannot provide complementary informa- 
tion in real-time vvhile the user is vvatching a broadcast. 

Another social and interactive television application thatis 
lacking vvith conventional interactive television systems is the 
ability to dynamically link a vievver vvith an ad hoc social peer 
community (e.g., a discussion group, chat room, etc.) in real- 
time. Imagine that you are vvatching the latest episode of 


- 


5 


20 


40 


45 


55 


65 


2 


“Friends” on television and discover that the character 
“Monica” is pregnant. You vvant to chat, comment or read 
other vievvers” responses to the scene in real-time. One option 
vvould be to log on your computer, type in the name of 
“Friends” or other related terms into a search engine, and 
perform a search to find a discussion group on “Friends.” 
Such requfred action by the vievver, hovvever, vvould diminish 
the passive experfence offered by mass media and vvould not 
enable the vievver to dynamically interact (e.g., comment, 
chat, etc.) vvith other vievvers vvho are vvatching the program 
at the same time. 


SUMMARY 


The deficiencies described above are addressed by the dis- 
closed systems, methods, apparatuses, user interfaces and 
computer program products for providing social and interac- 
tive applications based on real-time ambient-audio and/or 
video identification. 

In some implementations, a method includes: receiving a 
descriptor identifying ambient audio associated vvith a media 
broadcast, comparing the descriptor to reference descriptors 
associated vvith the media broadcast: and aggregating person- 
alized information related to the media broadcast based on the 
result of the comparison. 

In some implementations, a method includes: receiving a 
first descriptor identifying ambient audio associated vvith a 
first media broadcast: receiving a second descriptor identify- 
ing ambient audio associated vvith a second media broadcast: 
comparing the first and second descriptors to determine ifthe 
first and second media broadcasts are the same: and aggre- 
gating personalized information based on the result of the 
comparison. 

In some implementations, a method includes: detecting 
ambient audio associated vvith a media broadcast: generating 
descriptors identifying the media broadcast: transmitting the 
descriptors to a netvvork resource, and receiving aggregated 
personalized information from the netvvork resource based on 
the descriptors. 

In some implementations, a system includes a database of 
reference descriptors. A database server is operatively 
coupled to the database and to a client system. The database 
server 1s configurable to recefve a descriptor from the client 
system for identifying ambient audio associated vvith a media 
broadcast, comparing the received descriptor vvith one or 
more reference descriptors, and aggregating personalized 
information related to the media broadcast based on the result 
of the comparison. 

In some implementations, a system includes an audio 
detector configurable for sampling ambient audio. A client 
interface is operatively coupled to the audio detector and 
configurable to generate descriptors identifying a media 
broadcast. The client interface is configurable for transmit- 
ting the descriptors to a netvvork resource, and for receiving 
aggregated personalized information from the netvvork 
resource based on the descriptors. 

Other implementations are directed to systems, methods, 
apparatuses, user interfaces, and computer program products. 


DESCRIPTION OF DRAVVINGS 


FIG. 1 is a block diagram of one embodiment of a mass 
personalization system. 

FIG. 2 illustrates one embodiment of an ambient-audio 
identification system, including the client-side interface 
shovvn in FIG.1. 


US 8,479,225 B2 


3 


FIG. 3 is a flovv diagram of one embodiment of a process 
for providing mass-personalization applications. 

FIG. 4 is a flovv diagram of one embodiment of an audio 
fingerprinting process. 

FIG. 5 is a flovv diagram of one embodiment of a user 
interface for interacting vvith mass personalization applica- 
tions. 

FIG. 6 is a block diagram of one embodiment of hardvvare 
architecture for a client system for implementing the client- 
side interface shovvn in FIG. 1. 

PIG.7 is a flovv diagram of one embodiment of a repetition 
detection process. 


DETAILED DESCRIPTİON 
Mass Personalization Applications 


Mass personalization applications provide personalized 
and interactive information related to mass media broadcasts 
(e.g., television, radio, movies, Internet broadcasts, etc.). 
Such applications include but are not limited to: personalized 
information layers, ad hoc social peer communities, real-time 
popularity ratings and video (or audio) bookmarks, etc. 
Although some oftthe mass media examples disclosed herein 
are in the context of television broadcasts, the disclosed 
implementations are equally applicable to radio and/or music 
broadcasis. 

Personalized information layers provide complementary 
information to the mass media channel. Examples of person- 
alized information layers include but are not limited to: fash- 
ion, politics, business, health, traveling, etc. For example, 
vvhile vvatching a nevvs segment on a celebrity, a fashion layer 
is presented to the vievver on a television screen or a computer 
display device, vvhich provides information and/or images 
related to the clothes and accessories the celebrity is vvearing 
in the nevvs segment. Additionally, personalized layers may 
include advertisements promoting products or services 
related to the nevvs segment, such as a link to a clothing store 
that is selling clothes that the celebrity is vvearing. 

Ad hoc social peer communities provide a venue for com- 
mentary betvveen users vvho are vvatching the same shovv on 
television or listening to the same radio station. For example, 
a user vvho is vvatching the latest CNN headlines can be 
provided vvith a commenting medium (e.g., a chat room, 
message board, vviki page, video link, etc.) that allovvs the 
user to chat, comment on or read other vievvers responses to 
the ongoing mass media broadcast. 

Real-time popularity ratings provide content providers and 
users vvith ratings information (similar to Nielsen ratings). 
For example, a user can instantaneousİy be provided vvith 
yeal-time popularity ratings of television channels or radio 
stations being vvatched or listened to by the user”s social 
netvvork and/or by people vvith similar demographics. 

Video or audio bookmarks provide users vvith lovv effort 
vvays ofcreating personalized libraries oftheir favorite broad- 
cast content. For example, a user can simply press a button on 
a computer or a remote control device and a snippet of ambi- 
ent audio and/or video of the broadcast content is recorded, 
processed and saved. The snippet can be used as a bookmark 
to refer to the program, or portions of the program, for later 
vievving. The bookmark can be shared vvith friends or saved 
for future personal reference. 


Mass Personalization Netvvork 


FIG. 11s a block diagram ofa mass personalization system 
100 for providing mass personalization applications. The sys- 


20 


25 


30 


35 


40 


45 


50 


55 


65 


4 


tem 100 includes one or more client-side interfaces 102, an 
audio database server 104 and a social application server 106, 
all ofvvhich communicate over a netvvork 108 (e.g., the Inter- 
net, an intranet, LAN, vvireless netvvork, etc.). 

A client interface 102 can be any device that allovvs a user 
to enter and receive information, and vvhich is capable of 
presenting a user interface on a display device, including but 
not limited to: a desktop or portable computer, an electronic 
device: a telephone: a mobile phone: a display system, a 
television: a computer monitor, a navigation system, a por- 
table media player/recorder: a personal digital assistant 
(PDA ): a game console: a handheld electronic device: and an 
embedded electronic device or appliance. The client interface 
102 is described more fully vvith respect to FIG. 2. 

In some implementations, the client-interface 102 includes 
an ambient audio detector (e.g., a microphone) for monitoring 
and recording the ambient audio ofa mass media broadcastin 
a broadcast environment (e.g., a user”s living room). One or 
more ambient audio segments or “snippets” are converted 
into distinctive and robust statistical summaries, referred to as 
“audio fingerprints” or “descriptors.” In some implementa- 
tions, the descriptors are compressed files containing one or 
more audio signature components that can be compared vvith 
a database of previously generated reference descriptors or 
statistics associated vvith the mass media broadcast. 

A technique for generating audio fingerprints for music 
identification is described in Ke, Y., Hotem, D., Sukthankar, 
R. (2005), Computer Vision for Music ldentification, 7 Proc. 
Computer Vision and Pattern Recognition (hereinafter “Ke et 
al””), vvhich is incorporated herein by reference in its entirety. 
In some implementations, the music identification approach 
proposed by (Ke et al.) is adapted to generate descriptors for 
television audio data and queries, as described vvith respect to 
FIG. 4. 

A technique for generating audio descriptors using vvave- 
lets is described in U.S. Provisional Patent Application No. 
60/823.881, for “Audio Identification Based on Signatures.” 
That application describes a technique that uses a combina- 
tion of computer-vision techniques and large-scale-data- 
stream processing algorithms to create compact descriptors/ 
fingerprints of audio snippets that can be efficientiy matched. 
The technique uses vvavelets, vvhich is a knovvn mathematical 
tool for hierarchically decomposing functions. 

In “Audio Identification Based on Signatures,” an imple- 
mentation of a retrieval process includes the follovving steps: 
1) given the audio spectra ofan audio snippet, extract spectral 
images of, for example, 11.6”vv ms duration, vvith random 
spacing averaging d-ms apart. For each spectral image: 2) 
compute vvavelets on the spectral image: 3) extract the top-t 
vvavelets: 4) create a binary representation of the top-t vvave- 
lets: 5) use min-hash to create a sub-fingerprint of the top-t 
vvavelets: 6) use LSH vvith b bins and 1 hash tables to find 
sub-fingerprint segments that are close matches: 7) discard 
sub-fingerprints vvith less than v matches, 8) compute a Ham- 
ming distance from the remaining candidate sub-fingerprints 
to the query sub-fingerprint: and 9) use dynamic program- 
ming to combined the matfches across time. 

In some implementations, the descriptors and an associ- 
ated user identifier (“user id”) for identifying the client-side 
interface 102 are sent to the audio database server 104 via 
netvvork 108. The audio database server 104 compares the 
descriptor to a plurality of reference descriptors, vvhich vvere 
previously determined and stored in an audio database 110 
coupled to the audio database server 104. In some implemen- 
tations, the audio database server 104 continuousİy updates 
the reference descriptors stored in the audio database 110 
İrom recent mass media broadcasts. 


US 8,479,225 B2 


5 


The audio database server 104 determines the best matches 
betvveen the recetved descriptors and the reference descrip- 
tors and sends best-match information to the social applica- 
tion server 106. The matching process is described more fully 
vvith respect to FIG. 4. 

In some implementations, the social application server 106 
accepts vveb-brovvser connections assocfated vvith the client- 
side interface 102. Using the best-match information, the 
social application server 106 aggregates personalized infor- 
mation for the user and sends the personalized information to 
the client-side interface 102. The personalized information 
can include butis not limited to: advertisements, personalized 
information layers, popularity ratings, and information asso- 
ciated vvith a commenting medium (e.g., ad hoc social peer 
communities, forums, discussion groups, video conferences, 
etc.). 

In some implementations, the personalized information 
can be used to create a chatroom for vievvers vvithout knovving 
the shovv that the vievvers are vvatching in real time. The chat 
rooms can be created by directly comparing descriptors in the 
data streams transmitted by client systems to determine 
matches. That is, chat rooms can be created around vievvers 
having matching descriptors. In such an implementation, 
there is no need to compare the descriptors received from 
vievvers against reference descriptors. 

In some implementations, the social application server 106 
serves a vveb page to the client-side interface 102, vvhich is 
received and displayed by a vveb brovvser (e.g., Microsoft 
Internet ExplorerTM) running at the client-side interface 102. 
The social application server 106 also receives the user id 
İrom the client-side interface 102 and/or audio database 
server 104 to assist in aggregating personalized content and 
serving vveb pages to the client-side interface 102. 

It should be apparent that other implementations of the 
system 100 are possible. For example, the system 100 can 
include multiple audio databases 110, audio database servers 
104 and/or social application servers 106. Alternatively, the 
audio database server 104 and the social application server 
106 can be a single server or system, or part of a netvvork 
resource and/or service. Also, the netvvork 108 can include 
multiple netvvorks and links operatively coupled together in 
various topologies and arrangements using a variety of net- 
vvork devices (e.g., hubs, routers, etc.) and mediums (e.g., 
copper, optical fiber, radio frequenctes, etc.). Client-server 
architectures are described herein only as an example. Other 
computer architectures are possible. 


Ambient Audio Identification System 


FIG. 2 illustrates an ambient audio identification system 
200, including a client-side interface 102 as shovvn in FIG.1. 
The system 200 includes a mass media system 202 (e.g., a 
television set, radio, computer, electronic device, mobile 
phone, game console, netvvork appliance, etc.), an ambient 
audio detector 204, a cltent-side interface 102 (e.g., a desktop 
or laptop computer, etc.) and a netvvork access device 206.1n 
some implementations, the client-side interface 102 includes 
a display device 210 for presenting a user interface (UT) 208 
for enabling a user to interact vvith a mass personalization 
application, as described vvith respect to FIG. 5. 

In operation, the mass media system 202 generates ambient 
audio of a mass media broadcast (e.g., television audio), 
vvhich is detected by the ambient audio detector 204. The 
ambient audio detector 204 can be any device that can detect 
ambient audio, including a freestanding microphone and a 
microphone that is integrated vvith the client-side interface 
102. The detected ambfent audio is encoded by the client-side 


20 


25 


30 


35 


40 


45 


50 


55 


60 


65 


6 


interface 102 to provide descriptors identifying the ambient 
audio. The descriptors are transmitted to the audio database 
server 104 by vvay of the netvvork access device 206 and the 
netvvork 108. 

In some implementations, client softvvare running at the 
client-side interface 102 continually monitors and records 
n-second (e.g., 5 second) audio files (“snippets”) of ambient 
audio. The snippets are then converted into m-frames (e.g., 
415 ffames) of k-bit encoded descriptors (e.g., 32-bit), 
according to a process described vvith respect to FIG. 4. In 
some implementations, the monitoring and recording is event 
based. For example, the monitoring and recording can be 
automatically initiated on a specified date and at a specified 
time (e.g., Monday, 8:00 P.M.) and for a specified time dura- 
tion (e.g., betvveen 8:00-9:00 P.M.). Alternatively, the moni- 
toring and recording can be initiated in response to user input 
(e.g., a mouse click, function key or key combination) from a 
control device (e.g., a remote control, etc.). In some imple- 
mentations, the ambient audio is encoded using a streaming 
variation of the 32-bit/frame discriminative features 
described in Ke et al. 

In some implementations, the client softvvare runs as a 
“side bar” or other user interface element. That vvay, vvhen the 
client-side interface 102 is booted up, the ambient audio 
sampling can start immediately and run in the “background” 
vvith results (optionally) being displayed in the side bar vvith- 
out invoking a full vveb-brovvser session. 

In some implementations, the ambient audio sampling can 
begin vvhen the client-side interface 102 is booted orvvhenthe 
vievver 1ogs into a service or application (e.g., email, etc.) 

The descriptors are sent to the audio database server 104.1n 
some implementations, the descriptors are compressed sta- 
tistical summarites of the ambient audio, a described in Ke et 
al. By sending statistical summaries, the user”s acoustic pri- 
vacy is maintained because the statistical summaries are not 
reversible, 1.e., the original audio cannot be recovered from 
the descriptor. Thus, any conversations by the user or other 
individuals monitored and recorded in the broadcast environ- 
ment cannot be reproduced from the descriptor. In some 
implementations, the descriptors can be encrypted for extra 
privacy and security using one or more knovvn encryption 
techniques (e.g., asymmetric or symmetric key encryption, 
elliptic encryption, etc.). 

In some implementations, the descriptors are sent to the 
audio database server 104 as a query submission (also 
referred to as a query descriptor) in response to a trigger event 
detected by the monitoring process at the client-side interface 
102. For example, a trigger event could be the opening theme 
of a television program (e.g., opening tune of “Seinfeld”) or 
dialogue spoken by the actors. In some implementations, the 
query descriptors can be sent to the audio database server 104 
as part of a continuous streaming process. In some implemen- 
tations, the query descriptors can be transmitted to the audio 
database server 104 in response to user input (e.g., via remote 
control, mouse clicks, etc.). 


Mass Personalization Process 


FIG. 3 is. a flovv diagram a mass personalization process 
300. The steps of process 300 do not have to be completed in 
any particular order and at least some steps can be performed 
at the same time in a multi-threading or parallel processing 
environment. 

The process 300 begins vvhen a clfent-side interface (e.g., 
client-side interface 102) monitors and records snippets of 
ambient audio of a mass media broadcast in a broadcast 
environment (302). The recorded ambfent audio snippets are 


US 8,479,225 B2 


7 


encoded into descriptors (e.g., compressed statistical summa- 
ries), vyhich can be sent to an audio database server (304) as 
queries. The audio database server compares the queries 
against a database of reference descriptors computed from 
mass media broadcast statistics to determine candidate 
descriptors that best match the query (308). The candidate 
descriptors are sent to a social application server or other 
netvvork resource, vvhich uses the candidate descriptors to 
aggregate personalized information for the user (310). For 
example, if the user is vvatching the television shovv “Sein- 
Teld,” then query descriptors generated from the shovv”s ambi- 
ent audio vvill be matched vvith reference descriptors derived 
İTrom previous “Seinfeld” broadcasts. Thus, the best matching 
candidate descriptors are used to aggregate personalized 
information relating to “Seinfeld” (e.g., nevvs storfes, discus- 
sion groups, links to ad hoc social peer communities or chat 
rooms, advertisements, etc.). In some implementations, the 
matching procedure is efficientiy performed using hashing 
techniques (e.g., direct hashing or locality sensitive hashing 
(LSH)) to achieve a short list of candidate descriptors, as 
described vvith respect to FIG. 4. The candidate descriptors 
are then processed in a validation procedure, such as 
described in Ke et al. 

In some implementations, query descriptors from different 
vievvers are directly matched rather than matching each query 
vvith a database ofreference descriptors. Such an embodiment 
vvould enable the creation of ad hoc social peer communities 
on subfect matter for vvhich a database of reference descrip- 
tors is not available. Such an embodiment could match in 
real-time vievvers vvho are in the same public form (e.g., 
stadium, bar, etc.) using portable electronic devices (e.g., 
mobile phones, PDAs, etc.). 


Popularity Ratings 


In some implementations, real-time and aggregate statis- 
tics are inferred from a list of vievvers currently vvatching the 
broadcast (e.g., shovv, advertisement, etc.). These statistics 
can be gathered in the background vvhile vievvers are using 
other applications. Statistics can include but are not limited 
to: 1) the average number of vfevvers vvatching the broadcast, 
2) the average number of times vievvers vvatched the broad- 
cast: 3) other shovvs the vievvers vvatched: 4) the minimum and 
peak number ofvfevvers, 5) vvhat vievvers most often svvitched 
to vyhen they left a broadcast: 6) hovv long vievvers vvatch a 
broadcast, 7) hovv many times vifevvers flip a channel: 8) 
vvhich advertisements vvere vvatched by vfevvers, and 9) vvhat 
vievvers most often svvitched from vvhen they entered a broad- 
cast, etc. From these statistics, one or more popularity ratings 
can be determined. 

The statistics used to generate popularity ratings can be 
generated using a counter for each broadcast channel being 
monitored. In some implementations, the counters can be 
intersected vvith demographic group data or geographic group 
data. The popularity ratings can be used by vfevvers to “see 
vvhat”s hot” vvhile the broadcast is ongoing (e.g., by noticing 
an increased rating during the 2004 Super Bovvl half-time 
performance). Advertisers and content providers can also use 
popularity ratings to dynamically adfustthe material shovvnin 
response to ratings. This is especially true for advertisements, 
since the short unit length and numerous versions of adver- 
tisements generated by advertising campaigns are easily 
exchanged to adfust to vievverrating levels. Other examples of 
statistics include but are not limited to: popularity of a tele- 
vision broadcast versus a radio broadcast by demographics or 
time, the popularity of times of day, 1.e., peak vvatching/ 
listening times, the number of households in a given area, the 


30 


35 


40 


45 


50 


55 


8 


amount of channel surfing during particular shovvs (genre of 
shovvs, particular times of day), the volume of the broadcast, 
etc. 

The personalized information is sent to the client-side 
interface (312). The popularity ratings can also be stored in a 
database for use by other processes (318), such as the 
dynamic adfustment of advertisements described above. The 
personalized information is recefved at the client-side inter- 
face (314) vvhere it is formatted and presented in a user 
interface (316). The personalized information can be associ- 
ated vvith a commenting medium (e.g., text messages in a chat 
room) that is presented to the user in a user interface. In some 
implementations, a chat room can include one or more sub- 
groups. For example, a discussion group for “Seinfeld” might 
include a subgroup called “Seinfeld Experts,” or a subgroup 
may be assoclated vvith a particular demographic, such as 
vvomen betvveen the ages of 20-30 vvho vvatch “Seinfeld,” etc. 

In some implementations, the ravv information (e.g., 
counter values) used to generate statistics for popularity rat- 
ings is collected and stored at the client-side interface rather 
than at the social application server. The ravv information can 
be transferred to the broadcaster vvhenever the user is online 
and/or invokes a mass personalization application. 

In some implementations, a broadcast measurement box 
(BMB) is installed at the client-side interface. The BMB can 
be a simple hardvvare device that is similar to a set-top box but 
does not connect to the broadcast device. Unlike the Neilsen 
rating system, vvhich requires hardvvare to be installed in the 
television, the BMB can be installed near the mass media 
system or vvithin the range of the television signal. In some 
implementations, the BMB automatically records audio snip- 
pets and generates descriptors, vvhich are stored in memory 
(e.g., flash media). In some implementations, the BMB can 
optionally include one or more hardvvare buttons vvhich can 
be pressed by a user to indicate vvhich broadcast they are 
vvatching (similar to Neilseni8 ratings). The BMB device can 
be picked-up by the ratings provider from time to time to 
collect the stored descriptors, or the BMB can broadcast the 
stored descriptors to one or more interested parties over a 
netvvork connection (e.g., telephone, Internet, vvireless radio, 
such as SMS/carrters radio, etc.) from time to time. 

In some implementations, advertisements can be moni- 
tored to determine the ad”s effectiveness, vvhich can be 
reported back to advertisers. For example, vvhich ads vvere 
vvatched, skipped, volume level of the ads, etc. 

In some implementations, an image capture device (e.g., 
digital camera, video recorder, etc.) can be used to measure 
hovv many vievvers are vvatching or listening to a broadcast. 
For example, various knovvn pattern-matching algorithms 
can be applied to an image or a sequence of images to deter- 
mine the number of vfevvers present in a broadcast environ- 
ment during a particular broadcast. The images and or data 
derived from the images can be used in combination vvith 
audio descriptors to gather personalized information for a 
user, compute popularity ratings, or for any other purpose. 


Audio Fingerprinting Process 


FIG. 4 is a flovv diagram of audio fingerprinting process 
400. The steps of process 400 do not have to be completed in 
any particular order and at least some steps can be performed 
at the same time in a multi-threading or parallel processing 
environment. The process 400 matches query descriptors 
generafed at a clfent-side interface (e.g., client-side interface 
102) to reference descriptors stored in one or more databases 
in real-time and vvith lovv latency. The process 400 adapts a 


US 8,479,225 B2 


9 


technique proposed by Ke et al. to handle ambient audio data 
(e.g., from a television broadcast) and queries. 

The process 400 begins at a client-side interface by decom- 
posing ambient audio snippets (e.g., 5-6 seconds of audio) of 
a mass media broadcast captured by an ambient audio detec- 
tor (e.g., microphone) into overlapping frames (402). In some 
implementations, the frames are spaced apart by several mil- 
liseconds (e.g., 12 ms apart). Fach frame is converted into a 
descriptor (e.g., a 32-bit descriptor) that is trained to over- 
come audio noise and distortion (404), as described in Ke et 
al. In some implementations, each descriptor represents an 
identifying statistical summary of the audio snippet. 

In some implementations, the descriptors can be sent as 
query snippets (also referred to as query descriptors) to an 
audio database server vvhere they are matched to a database of 
reference descriptors identifying statistical summaries ofpre- 
viously recorded audio snippets of the mass media broadcast 
(406). A list ofcandidate descriptors having best matches can 
be determined (408). The candidate descriptors can be 
scored, such that candidate descriptors that are temporally 
consistent vvith the query descriptor are scored higher than 
candidate descriptors that are less temporally consistent vvith 
the query descriptor (410). The candidate descriptors vvith the 
highest scores (e.g., score exceeds a sufficiently high thresh- 
old value) are transmitted or othervvise provided to a social 
application server (412) vvhere they can be used to aggregate 
personalized information related to the media broadcast. 
Using a threshold ensures that descriptors are sufficientiy 
matched before the descriptors are transmitted or othervvise 
provided to the social application server (412). 

In some implementations, the database of reference 
descriptors can be generated from broadcasts given by vari- 
ous media companies, vvhich can be indexed and used to 
generate the descriptors. In other implementations, reference 
descriptors can also be generated using television guides or 
other metadata and/or information embedded intthe broadcast 
signal. 

In some implementations, speech recognition technology 
can be used to help identify vvhich program is being vvatched. 
Such technology could help users discuss nevvs events instead 
of fust television shovvs. For example, a user could be vvatch- 
ing a Shuttle launch on a different channel than another 
vievver and, therefore, possibly getting a different audio sig- 
nal (e.g., due to a different nevvscaster). Speech recognition 
technology could be used to recognize keyvvords (e.g., 
Shuttle, launch, etc.), vvhich can be used to link the user vvith 
a commenting medium. 


Hashing Descriptors 


Ke et al. uses computer vision techniques to find highiy 
discriminative, compact statistics for audio. Their procedure 
trained on labeled pairs of positive examples (vvhere x and x" 
are noisy versions of the same audio) and negative examples 
(vvhere x and x" are from different audio). During this training 
phase, machine-learning technique based on boosting uses 
the labeled pairs to select a combination of 32 filters and 
thresholds that fointiy create a highİy discriminative statistic. 
The filters localize changes in the spectrogram magnitude, 
using first and second order differences across time and fre- 
quency. One benefit of using these simple difference filters is 
that they can be calculated efficiently using a integral image 
technique described in V1ola, P. and Tones, M. (2002), Robust 
Real-Time Obifect Detection, /7zzerxafional Vournal of Com- 
puter Vision, vvhich is incorporated by reference herein in its 
entirety. 


20 


40 


45 


50 


55 


60 


65 


10 


In some implementations, the outputs ofthese 32 filters are 
thresholds, giving a single bit per filter at each audio frame. 
These 32 threshold results form only transmitted descriptors 
of that İrame of audio. This sparsity in encoding ensures the 
privacy of the user to unauthorized eavesdropping. Further, 
these 32-bit descriptors are robust to the audio distortions in 
the training data, so that positive examples (e.g., matching 
İrames) have small Hamming distances (1.e., distance mea- 
suring differing number of bits) and negative examples (e.g., 
mismatfched frames) have large Hammingdistances. İt should 
be nofed that more or fevver filters can be used and more than 
one bit per filter can be used at each audio frame (e.g., more 
bits using multiple threshold tests). 

In some implementations, the 32-bit descriptor itself used 
as a hash key for direct hashing. The descriptor is a vvell- 
balanced hash function. Retrfeval rates are further improved 
by querying not only the query descriptor, but also a small set 
ofsimilar descriptors (up toa Hamming distance of 2 from the 
original query descriptor). 


VVithin-Query Temporal Consistency 


Once the query descriptors are matched to the audio data- 
base using the hashing procedure described above, the 
mafches are validated to determine vvhich of the database 
return hits are accurate matches. Othervvise, a candidate 
descriptor might have many frames matched to the query 
descriptor but vvith the vvrong temporal structure. 

In some implementations, validation is achieved by vievv- 
ing each database hit as support for a match at a specific 
query-database offset. For example, if the eight descriptor 
(qs) in a 5-second, 415-frame-long “Seinfeld” query snippet, 
q, hits the 1008” database descriptor (x. oos), this supports a 
candidate match betvveen the 5-second query and frames 
1001 through 1415 in the audio database. Other matches 
betvveen q, and Xooo.., (1“8n52415) vvould support this same 
candidate match. 

Inaddition to temporal consistency, vve need to account for 
İrames vvhen conversations temporarily drovvn out the ambi- 
ent audio. This can be modeled as an exclusive svvitch 
betvveen ambient audio and interfering sounds. For each 
query frame i, there is a hidden variable, y,: if y,”0, the i” 
İrame ofthe query is modeled as interference only: if y,”1, the 
1” frame is modeled as from clean ambient audio. Taking an 
extreme vievv (pure ambient or pure interference) is fustified 
by the extremely İlovv precision vvith vvhich each audio frame 
is represented (32 bits) and softened by providing additional 
bit-flop probabilities for each ofthe 32 positions ofthe frame 
vector under each of the tvvo hypotheses (y,”0 and y/”1). 
Finally, vve model the betvveen-frame transitions betvveen 
ambient-only and interference-only states as a hidden first- 
order Markov process, vvith transition probabilities derived 
İrom training data. For example, vve can re-use the 66-param- 
eter probability model given by Ke et al., CVPR 2005. 

The final model of the match probability betvveen a query 
vector, q, and an ambient-database vector at an offset of N 
İrames, xx, 1: 


415 
Pqq 1x") - İ İ Pe qa) sasi. 1yə)PÜya İya-D 


nel 


(D 


vvhere eq, x,” denotes the bit differences betvveen the 32-bit 
İrame vectors q,, and x,,. This model incorporates both the 


temporal consisteney constraint and the ambient/interference 
hidden Markov model. 


US 8,479,225 B2 


11 


Post-Match Consisteney Filtering 


People often talk vvith others vvhile vvatching television, 
yesulting in sporadic but strong acoustic interference, espe- 
cially vvhen using laptop-based microphones for sampling the 
ambient audio. Given that most conversational utterances are 
tvvo or three seconds in duration, a simple communication 
exchange betvveen vievvers could render a 5-second query 
unrecognizable. 

In some implementations, post-match filtering is used to 
handle these intermittent 1ovv-confidence mismatches. For 
example, vve can use a continuous-time hidden Markov 
model of channel svvitching vvith an expected dvvell time (1.e., 
time betvveen channel changes) of L seconds. The social 
application server 106 indicates the highest-confidence 
match vvithin the recent past (along vvith its “discounted” 
confidence) as part of state information associated vvith each 
client session. Using this information, the server 106 selects 
either the content-index match from the recent past or the 
current index match, base on vvhichever has the higher con- 
fidence. 

VVeuse M, and C, to referto the best match for the previous 
time step (5 seconds ago) and its 1og-likelihood confidence 
score. If vve simpliy appiy the Markov model to this previous 
best match, vvithout taking another observation, then our 
expectation is that the best match for the current time is that 
same program sequence, fust 5 seconds further along, and our 
confidence in this expectation is C,—İ/L, vyhere 1-5 seconds is 
the query time step. This discount of T/L in the 1og-likelihood 
corresponds to the Markov model probability, e””7, of not 
svvitching channels during the 1-lengfth time step. 

An alternative hypothesis is generated by the audio match 
for the current query. VVe use Mo to refer to the best match for 
the current audio snippet: that is, the match that is generated 
by the audio fingerprinting process 400. Co is the log-likeli- 
hood confidence score given by the audio fingerprinting pro- 
cess 400. 

If these tvvo matches (the updated historical expectation 
and the current snippet observation) give different matches, 
vve select the hypothesis vvith the higher confidence score: 


(Mp, Ci İD if Ci, -1/Lo Cə 
Mo, Co) 


(2) 
Mo, Co) -İ 


othervvise, 


vvhere Mə is the match that is used by the social application 
server 106 for selecting related content and Mə and C, are 
carried forvvard on the next time step as M, and C,. 


User Interface 


FIG. 5 is a flovv diagram of one embodiment of a user 
interface 208 for interacting vvith mass personalization appli- 
cations. The user interface 208 includes a personalized layer 
display area 502, a commenting medium display area 504, a 
sponsored links display area 506 and a content display area 
508. The personalized layer display area 502 provides 
complementary information and/or images related to the 
video content shovvn in the content display area 508. The 
personalized layers can be navigated using a navigation bar 
510 and an input device (e.g., a mouse or remote control). 
Fach layer has an associated label in the navigation bar 510. 
For example, if the user selects the “Fashion” label, then the 
fashion layer, vvhich includes fashion related content associ- 
ated vvith “Seinfeld,” vvill be presented in the display area 502. 


20 


25 


45 


50 


55 


60 


65 


12 


In some implementations, the client-side interface 102 
includes a display device 210 capable of presenting the user 
interface 208. In some implementations, the user interface 
208 is an interactive vveb page served by the social application 
server 106 and presented in a brovvser vvindovv on the screen 
ofthe display device 210. In some implementations, the user 
interface 208 is persistent and vvill be avallable for interaction 
after the broadcast audio used in the content match process 
has shifted in time. In some implementations, the user inter- 
face 208 is dynamically updated over time or in response to a 
trigger event (e.g., a nevv person enters the chat room, a 
commercial begins, etc.). For example, each time a commer- 
cial is broadcast, the sponsored links display area 506 can be 
updated vvith fresh links 518 related to the subfect matter of 
the commercial. 

In some implementations, the personalized information 
and sponsored links can be emalled to the vievver or shovvn on 
a side bar at a later time. 

In some implementations, the client-side interface 102 
receives personalized information from the social application 
server 106. This information can include a vveb page, emall, a 
message board, links, instant message, a chat room, or an 
invitation to foin an ongoing discussion group, eRoom, video 
conference or netmeeting, voice call (e.g., Skypef)), etc. In 
some implementations, the user interface 208 provides access 
to comments and/or links to comments from previously seen 
broadcasts or movies. For example, ifuser is currently vvatch- 
ing a DVD of “Shrek” he may vvant to see vvhat people said 
about the movie in the past. 

In some implementations, the display area 502 includes a 
rating region 512, vvhich is used to display popularity ratings 
related to a broadcast. For example, the display area 512 may 
shovv hovv many vievvers are currently vvatching “Seinfeld” 
compared to another television shovv that is broadcast at the 
same time. 

In some implementations, the commenting medium dis- 
play area 504 presents a chat room type environment vvhere 
multiple users can comment about broadcasts. In some imple- 
mentations, the display area 504 includes a text box 514 for 
inputting comments that are sent to the chat room using the 
input mechanism 516 (e.g., a button). 

The sponsored links display area 506 includes information, 
images and/or links related to advertising that is associated 
vvith the broadcast. For example, one of the links 518 may 
take the user to a vveb site that is selling “Seinfeld” merchan- 
dise. 

The content display area 508 is vvhere the broadcast con- 
tent is displayed. For example, a scene from the current 
broadcast can be displayed vvith other relevant information 
(e.g., episode number, title, timestamp, etc.). In some imple- 
mentations, the display area 508 includes controİls 520 (e.g., 
scroll buttons) for navigating through the displayed content. 


Video Bookmarks 


In some implementations, a button 522 is included in the 
content display area that can be used to bookmark video. For 
example, by clicking the button 522, the “Seinfeld” episode 
shovvn in the display area 508 is added to the user”s favorites 
video library, vvhich can then be vievved on-demand through 
a vveb-based streaming application or other access methods. 
Acceording to the policy set by the content ovvner, this stream- 
ing service can provide free single-vievving playback, collect 
payments as the agent for the content ovvners, or insert adver- 
tisements that vvould provide payment to the content ovvners. 


Client-Side Interface Hardvvare Architecture 


FIG. 6 is block diagram of hardvvare architecture 600 for 
the client-side interface 102 shovvn in FIG. 1. Although the 


US 8,479,225 B2 


13 


hardvvare architecture 600 is typical of a computing device 
(e.g., a personal computer), the disclosed implementations 
can be realized in any device capable of presenting a user 
interface on a display device, including but not limited to: 
desktop or portable computers: electronic devices: tele- 
phones: mobile phones: display systems: televisions: moni- 
tors: navigation systems: portable media players/recorders: 
personal digital assistants: game systems: handheld elec- 
tronic devices: and embedded electronic devitces or appli- 
ances. 


In some implementations, the system 600 includes one or 
more processors 602 (e.g., CPU), optionally one or more 
display devices 604 (e.g., CRT, LCD, etc.), a microphone 
interface 606, one or more netvvork interfaces 608 (e.g., USB, 
Ethernet, FireVVire(8 ports, etc.), optionally one or more input 
devices 610 (e.g., mouse, keyboard, etc.) and one or more 
computer-readable mediums 612. Fach of these components 
is operatively coupled to one or more buses 614 (e.g., EISA, 
PCI, USB, FireVVire8, NuBus, PDS, etc.). 


In some implementations, there are no display devices or 
input devices and the system 600 Tust performs sampling and 
encoding (e.g., generating descriptors, etc.) in the back- 
ground vvithout user input. 


The term “computer-readable medium” refers to any 
medium that participates in providing instructions to a pro- 
cessor 602 for execution, including vvithout limitation, non- 
volatile media (e.g., optical or magnetic disks), volatile media 
(e.g., memory) and transmission media. Transmission media 
includes, vvithout limitation, coaxial cables, copper vvire and 
fiber optics. Transmission media can also take the form of 
acoustic, light or radio frequency vvaves. 


The computer-readable mediumt(s) 612 further includes an 
operating system 616 (e.g., Mac OS, VVindovvsi$, Unix, 
Linux, etc.), a netvvork communications module 618, client 
softvvare 620 and one or more applications 622. The operating 
system 616 can be multi-user, multiprocessing, multitasking, 
multithreading, real-time and the like. The operating system 
616 performs basic tasks, including but not limited to: recog- 
nizing input from input devices 610: sending output to display 
devices 604: keeping track of files and directories on storage 
devices 612: controlling peripheral devices (e.g., disk drives, 
printers, image capture device, etc.): and managing traffic on 
the one or more buses 614. 


The netvvork communications module 618 includes vari- 
ous components for establishing and maintaining netvvork 
connections (e.g., softvvare for implementing communication 
protocols, such as TCP/IP, HTTP, Ethernet, USB, FireVVire), 
etc.). 


The client softvvare 620 provides various softvvare compo- 
nents for implementing the client-side of the mass personal- 
ization applications and for performing the various client- 
side functions described vvith respect to FIGS. 1-5 (e.g., 
ambient audio identification). In some implementations, 
some or all ofthe processes performed by the client softvvare 
620 can be integrated into the operating system 616. In some 
implementations, the processes can be at least partially imple- 
mented in digital electronic circufitry, or in computer hard- 
vvare, firmvvare, softvvare, or in any combination thereof. 


Other applications 624 can include any other softyvare 
application, including but not limited to: vvord processors, 
brovvsers, email, Instant Messaging, media players, tele- 
phony softvvare, etc. 


20 


25 


30 


35 


40 


45 


50 


55 


65 


14 


Detecting Advertisements and Rebroadcasts 


Repetition Detection 

VVhen preparing a database for search, it helps to be able to 
pre-flag repeated material using the descriptors previousiy 
described. Repeating material can include butis notlimited to 
repeating shovvs, advertisements, sub-segments (e.g., stock 
footagein nevvs shovvs), etc. Using these flags, repeated mate- 
rial can be presented in a vvay that does not push all other 
material beyond the attention span of a user conducting a 
search (e.g., beyond the first 10-20 hits). The process 700 
described belovv provides a vvay to detect those duplicates 
prior to any search querfes on the database. 
Video Ad Removal 

One of the complaints that broadcasters have had about 
allovving material to be searched and played back is the 
rebroadcast of embedded advertising. From the point of vievv 
of the broadcasters, this rebroadcast is counterproductive: it 
lovvers the value ofthe broadcasts that the advertiser pays for 
directly, since it provides that advertiser vvith İree advertising. 
Unless old advertisements are removed and nevv advertise- 
ments are put in place in a vvay that returns some revfevv to the 
original broadcasters, they do not profit from the replay of 
their previously broadcast material. The process 700 
described belovv provides a vvay of detecting embedded 
advertisement by looking for repetitions, possibly in confunc- 
tion vvith other criteria (e.g., duration, volume, visual activity, 
bracketing blank frames, etc.). 
Video Summarization 

Ifa “summary” (1.e., shorter version) of non-repeated pro- 
gram material is needed, one vvay to get that is to remove the 
advertisements (as detected by repeated material) and to take 
segments from the material ust preceding and Tust follovring 
the advertisement location. On broadcast television, these 
positions in the program typically contain “teasers” (before 
the ads) and “recaps” Oust after the ads). Ifa summary is to be 
made of a nevvs program that includes a mix of non-repeated 
and repeated non-advertisement material, typically the 
repeated non-advertisement material corresponds to a sound 
bite. These segments generally contribute less information 
than the anchorperson”s narration of the nevvs story and are 
good candidates for removal. Ifa summary is to be made ofa 
narrative program (e.g. a movie or a serial installment), 
repeated audio tracks typically correspond to theme sounds, 
mood music, or silence. Again, these are typically good seg- 
ments to remove from a summary video. The process 700 
described belovv provides a vvay of detecting these repeated 
audio tracks so they can be removed from the summary video. 


Repetition Detection Process 


FIG. 7 is a flovv diagram of one embodiment of a repetition 
detection process 700 in accordance. The steps of process 700 
do not have to be completed in any particular order and at least 
some steps can be performed at the same time in a multi- 
threading or parallel processing environment. 

The process 700 begins by creating a database of audio 
statistics from a set of content such as television feeds, video 
uploads, etc. (702). For example, the database could contain 
32-bit/İrame descriptors, as described in Ke et al. Querfes are 
taken from the database and run against the database to see 
vvhere repetitions occur (704). In some implementations, a 
short segment of audio statistics is taken as a query and run 
checked for non-identity matches (matches that are not iden- 
tical) using hashing techniques (e.g. direct hashing or locality 
sensitive hashing (LSH)) to achieve a short list of possible 
auditory matches. These candidate matches are then pro- 


US 8,479,225 B2 


15 


cessed in a validation procedure, for example, as described in 
Ke, et al. Content corresponding to a validated candidate 
match can be identifled as repeating content (706). 

The non-identity matches that are strongest are “grovvn” 
forvvards and backvvards in time, to find the beginning and 
ending points of the repeated material (708). In some imple- 
mentations, this can be done using knovvn dynamic program- 
ming techniques (e.g., Viterbi decoding). In extending the 
match forvvard in time, the last time slice in the strong “seed” 
match is set as “matching” and the last time slice of the first 
belovv-believable-strength match for the same database offset 
betvveen the query and the match is set as “not matching.” In 
some implementations, match scores for individual frames in 
betvveen these tvvo fixed points are used as observations and a 
first-order Markov model allovving vvithin state transitions 
plus a single transition from “matching” to “not-matching” 
states is used. The transition probability from matching to not 
matching to 1/L can be set somevvhat arbitrarily, vvhere L is 
the number of frames betvveen these tvvo fixed points, corre- 
sponding to the least knovvledge of the transition location 
vvithin the allovved range. Another possibility for selecting 
transition probabilities vvould use the match strength profiles 
to bias this estimate to an earlier or later transition. But this 
vvould increase the complexity of the dynamic programming 
modeland is not likely to improve the results, since the match 
strengths are already used as observations vvithin this period. 
The same process is used to grovv the segment matches back- 
vvards in time (e.g., yust svvitch past/future and run the same 
algorithm). 

In some implementations the audio cues are combined vvith 
non-auditory information (e.g., visual cues) to obtain higher 
matching accuracfes. For example, the matches that are found 
vvith audio matching can then be verified (or checked a second 
time) by using simple visual similarity metrics (710). These 
metrics can include but are not limited to: color histograms 
(e.g., İrequencies of similar colors in tvvo images), statistics 
on number and distribution of edges, etc. These need not be 
computed only over the entire image, but can be computed for 
sub-regions of the images as vvell, and compared to the cor- 
responding sub-regions in the target image. 

For those applications that are looking for advertisements 
(in contrast vvith all types of repeated material), the results of 
yepeated-material detection can be combined vvith metrics 
aimed at distinguishing advertisements from non-advertise- 
ments (712). These distinguishing characteristics can rely on 
advertising conventions, such as durations (e.g., 10/15/30- 
second spots are common), on volume (e.g., advertisements 
tend to be louder than surrounding program material, so if the 
yepeated material is louder than the material on either side, it 
is more likely to be an advertisement), on visual activity (e.g., 
advertisements tend to have more rapid transitions betvveen 
shots and more vvithin-shot motion, so ifthe repeated material 
has larger frame differences than the material on efther side, 
it is more likely to be an advertisement), and on bracketing 
blank frames (locally inserted advertisements typically do not 
completely fill the slot that is left for it by the national feed, 
yesulting in black frames and silence at a spacing that is a 
multiple of 30 seconds). 

Once advertisements are identifled, material surrounding 
the advertisements can be analyzed and statistics can be gen- 
erated. For example, statistics can be generated about hovv 
many times a particular product is advertised using a particu- 
lar creative (e.g., images, text), or hovv many times a particu- 
lar segment is aired, etc. In some implementations, one or 
more old advertisements can be removed or replaced vvith 
nevv advertisements. Additional techniques for advertisement 
detection and replacement are described in Covell, M., 


20 


25 


30 


35 


40 


45 


50 


55 


60 


65 


16 


Baluia, S., Fink, M., Advertisement Detection and Replace- 
ment Using Acoustic and Visual Repetition, IEEE Signal 
Processing Society, MMSP 2006 International VVorkshop on 
Multimedia Signal Processing, Oct. 3-6, 2006, BC Canada, 
vvhich article is incorporated by reference herein in its 
entirety. 

In some implementations, information from content ovvn- 
ers about the detailed structure of the content (e.g., vvhere ad 
material vvas inserted, vvhere programs vvere repeated) could 
be used to augment the process 700 and increase matching 
accuracies. In some implementations, video statistics can be 
used to determine repetition instead of audio. In other imple- 
mentations, a combination ofvideo and audio statistics can be 
used. 


Audio Snippet Auctions 


In some implementations, advertisers can participate in 
auctions related to the presence of ambient audio that is 
related to the product or service that the advertiser vvant to 
sell. For example, multiple advertisers could bid in an auction 
for the right to associate its products or services vvith an audio 
snippet or descriptor associated vvith “Seinfeld.” The vvinner 
ofthe auction could then put some related information in front 
ofthe vievver (e.g., the sponsored links) vvhenever the subyect 
ambient audio is present. In some implementations, advertis- 
ers could bid on ambient audio snippets having a meta-level 
description. For example, advertisers could bid on audio that 
is associated vvith a television ad (e.g., this is the audio asso- 
ciated vvith a Ford Explorer TV ad), on closed captioning 
(e.g., the captioning says “Yankees baseball”), on program 
segment location (e.g., this audio vvill occur 15 min into the 
“Seinfeld” and vvill occur 3 minutes after the previous com- 
mercial break and 1 min before the next commercial break), 
or on lovv-level acoustic or visual properties (e.g., “back- 
ground music,” “ explosive-like”, 
etc.) 

In some implementations, one or more mass personaliza- 
tion applications can be run in the background vvhile the user 
performs other tasks such as brovvsing another vveb site (e.g., 
a sponsored link). Material that is related to a media broadcast 
(e.g., television content) can participate in the same spon- 
sored link auctions as material that is related to another con- 
tent source (e.g., vveb site content). For example, TV related 
ads can be mixed vvith ads that correspond to the content ofa 
current vveb page. 

Various modifications may be made to the disclosed imple- 
mentations and still be vvithin the scope of the folloving 
claims. 


99 ce 


conversational voices, 


VVhat is claimed is: 

1. A computer-implemented method comprising: 

receiving, İrom a first client system, a user selection of a 
personalized information layer related to media content: 

receiving a first descriptor from the first client system, the 
first descriptor identifying the media content, vvhere the 
first descriptor is generated from the media content by 
the first client system, 

receiving a second descriptor from a second client system, 
the second descriptor also identifying the media content: 

identifying a match betvveen the first descriptor and the 
second descriptor, 

automatically creating a virtual social community related 
to the media content based at least in part on the identi- 
fied match, the virtual social community being a venue 
for communication betvveen users accessing the media 
content, vvherein automatically creating the virtual 


US 8,479,225 B2 


17 


social community includes automatically selecting par- 
ticipants ofthe social community from the users access- 
ing the media content, 

updating chat content to be shared in the virtual social 
community in an ongoing manner, and 

providing the virtual social community to the first client 
system for display in the personalized information layer, 

vvhere the method is performed by one or more processors 
during a broadcast of the media content. 

2. The method of claim 1, vvhere creating the virtual social 

community further comprises: 

determining one or more reference descriptors that corre- 
spond to the identified match betvveen the first descriptor 
and second descriptor based at least in part on matching 
criteria. 

3.The method ofclaim 1, vvhere the broadcast of the media 
content includes at least one of a television broadcast, a radio 
broadcast, a movie broadcast, or an Internet broadcast. 

4. The method of claim 1, vvhere the personalized informa- 
tion layer provides complementary information to the media 
content. 

5. The method of claim 1, vvhere the virtual social commu- 
nity includes a virtual chat room. 

6. The method of claim 5, vvhere participants of the virtual 
chat room are selected based at least in part on descriptors 
identifying the media content. 

7. The method of claim 6, vvhere updating the chat content 
includes updating real time chat content from the selected 
participants. 

8. A computer-implemented method comprising: 

on a client system, displaying one or more personalized 
information layers on a display, vvhere the personalized 
information layers present personalized information and 
relate to media content, 

receiving a user selection of a personalized information 
layer from the one or more personalized information 
layers: 

İrom the client system, transmitting a descriptor identify- 
ing the media content to a netvvork resource, the descrip- 
tor generated from the media content by the client sys- 
tem, 

yeceiving personalized information from the netvvork 
yesource, the personalized information including a vir- 
tual social community in vvhich real time chat content is 
shared, the real time chat content being updated in an 
ongoing manner, the virtual social community being 
automatically created based at least in part on a match 
betvveen the descriptor and another descriptor from 
another client system identifying the media content and 
including participants automatically selected from users 
accessing the media content, the other descriptor also 
identifying the media content: and 

displaying the personalized information in the selected 
personalized information layer on the display of the 
client system, 

vvhere the method is performed by one or more processors 
during a broadcast of the media content. 

9. The method ofclaim 8, vvhere transmitting the descriptor 

comprises: 

yecording snippets of the media content: 

decomposing the snippets of the media content into over- 
lapping İrames: and 

converting the frames into the descriptor, the descriptor 
identifying statistical summaries of the media content. 

10. The method of claim 8, vvhere the virtual social com- 
munity includes a virtual chat room. 


20 


25 


30 


35 


40 


45 


50 


55 


60 


65 


18 


11. The method of claim 8, vvhere the virtual social com- 
munity includes one or more subgroups, vvhere at least one of 
the one or more subgroups is assocfiated vvith demographic 
information. 

12. The method ofclaim 8, further comprising transmitting 
a client system identifier from the client system for identify- 
ing the client system to the netvvork resource. 

13. A system comprising: 

a computer-readable storage device including a computer 

program product: and 
one or more processors configured to interact vvith the 
storage device and execute the program product to per- 
form, during a media broadcast, operations comprising: 

receiving, İrom a client system, a request to create a virtual 
chat room related to the media broadcast being dis- 
played at the client system: 

receiving a descriptor from the client system, the descriptor 

identifying the media broadcast being displayed at the 
client system, vvhere the descriptor is generated from the 
media broadcast by the client system, 

automatically creating the virtual chat room, including 

automatically selecting participants of the virtual chat 
room from users accessing the media broadcast based at 
least in part on a match betvveen the descriptor and at 
least one other descriptor from another client system that 
also identifies the media broadcast, 

providing the virtual chat room to the client system for 

display in association vvith the media broadcast being 
displayed: 

updating chat content to be shared in the virtual chat room 

in an ongoing manner from the participants selected 
based at least in part on the matching descriptors iden- 
tifying the media broadcast: and 

sending the updated chat content to the client system in an 

ongoing manner for display in the virtual chat room. 

14. The system of claim 13, vvhere the descriptor is gener- 
ated from audio samples of the media broadcast. 

15. The system of claim 13, vvhere the virtual chat room 
includes one or more subgroups. 

16. The system of claim 13, vvhere the descriptor is gener- 
ated from audio samples of the media broadcast. 

17. The system of claim 13, vvhere the received descriptor 
is encrypted. 

18. The system of claim 13, vvhere the received descriptor 
is sent to a database server as a query submission in response 
to a trigger event at the client system. 

19. The system of claim 13, vvhere the received descriptor 
is sent to a database server as part of a streaming process. 

20. The system of claim 13, vvhere the participants are in a 
same geographic location. 

21. The system of claim 13, the operations further includ- 
ing receiving an identifier of the client system. 

22. The system ofclaim 13, vvhere the descriptorrepresents 
an identifying statistical summary of a sample of the media 
broadcast. 

23. The system of claim 13, vvhere creating the virtual chat 
room includes determining a match using one or more refer- 
ence descriptors that are temporally consistent vvith the 
descriptor generated by the client system. 

24. A system comprising: 

a computer-readable storage device including a computer 

program product: and 

one or more processors configured to interact vvith the 

storage device and execute the program product to per- 
form, during a broadcast of media content, operations 
comprising: 


US 8,479,225 B2 


19 


displaying the media contentand a commenting medium 
display area on a display device of the system: 

yeceiving an input to create a virtual chat room related to 
the media content being displayed: 

transmitting a descriptor identifying the media content 
to a netvvork resource, the descriptor generated from 
the media content by the system: 

receiving, İrom the netvvork resource, information 
related to the virtual chat room, the virtual chat room 
having participants automatically selected by the one 
or more processors from users accessing the media 
content based at least in part on a match betvveen the 
descriptor and another descriptor İrom another sys- 
tem, the other descriptor also identifying the media 
content, 

providing the virtual chat room for display in the com- 
menting medium display area, 

receiving chat content from the netvvork resource, the 
chat content being updated in an ongoing manner 
Trom the participants selected based at least in part on 
the matching descriptors identifying the media con- 
tent: and 

displaying the updated chat content in the virtual chat 
room in association vvith the media content in an 
ongoing manner. 

25. The system of claim 24, vvhere the display device is 
configured to allovv a user to interact vvith the virtual chat 
room and the media content. 

26. The system ofclaim 24, vvhere the broadcast includes at 
İcast one ofa television broadcast, a radio broadcast, a movie 
broadcast, or an Internet broadcast. 

27. The system ofclaim 24, the display device displaying a 
user interface, the user interface including: 

a personalized information layer display area for display- 

ing the virtual chat room, 

and 

a content display area for displaying the media content. 

28. The system ofclaim 24, vvhere the broadcast includes at 
least one of a television broadcast, a radio broadcast, a movie 
broadcast, or an Internet broadcast. 

29. The system of claim 24, the operations further com- 
prising transmitting a client system identifler from the system 
for identifying the system to the netvvork resource. 

30. The system ofclaim 24, vvhere transmitting the descrip- 
tor comprises: 

yecording snippets of the media content: 

decomposing the snippets of the media content into over- 

lapping frames: and 

converting the frames into the descriptor, the descriptor 

identifying statistical summaries of the media content. 

31. The system of claim 24, vvhere the descriptor is 
encrypted. 

32. The system of claim 24, vvhere the descriptor is sent to 
a database server as a query submission in response to a user 
selection. 

33. The system ofclaim 32, vvhere transmitting the descrip- 
tor identifying the media content to the netvvork resource 
includes sending the descriptor to the database server as part 
of a streaming process. 

34. A computer-readable storage device having instruc- 
tions stored thereon, vvhich, vvhen executed by a processor, 
causes the processor to perform operations comprising: 

yeceiving, from a client system, a request to create a virtual 

chat room related to a media broadcast being displayed 
on a display of the client system: 


20 


25 


30 


35 


40 


45 


55 


60 


65 


20 


receiving a descriptor from the client system, the descriptor 
identifying the media broadcast, vvhere the descriptor is 
generated from the media broadcast by the client sys- 
tem, 

automatically creating the virtual chat room, including 

automatically selecting participants of the virtual chat 
room from users accessing the media broadcast based at 
least in part on a match betvveen the descriptor and at 
least one other descriptor from another client system that 
also identifies the media broadcast, 

providing the virtual chat room to the client system for 

display in association vvith the media broadcast being 
displayed: 

updating, in an ongoing manner, chat content to be shared 

in the virtual chat room from the participants selected 
based at least in part on the matching descriptors iden- 
tifying the media broadcast: and 

sending the updated chat content to the client system for 

display in the virtual chat room in an ongoing manner, 
vvhere the operations are performed during the media 
broadcast. 

35. A computer-readable storage device having instruc- 
tions stored thereon, vvhich, vvhen executed by a processor, 
causes the processor to perform operations comprising: 

on a client system, displaying media content and a com- 

menting medium display area on a display, 

receiving, on the client system, an input to create a virtual 

chat room related to the media content being displayed 
at the client system: 

transmitting a descriptor identifying the media content to a 

netvvork resource, the descriptor generated from the 
media content by the client system: 

receiving, from the netvvork resource, information related 

to the virtual chat room, the virtual chat room having 
participants automatically selected by the processor 
İrom users accessing the media content based at least in 
part on a match betvveen the descriptor and another 
descriptor from another client system, the other descrip- 
tor also identifying the media content: 

providing the virtual chat room for display in the comment- 

ing medium display area, 
receiving chat content from the netvvork resource, the chat 
content being updated in an ongoing manner from the 
participants selected based at least in part on the match- 
ing descriptors identifying the media content: and 

displaying the updated chat content in the virtual chatroom 
in association vvith the media content in an ongoing 
manner, 

vvhere the operations are performed during a broadcast of 

the media content. 

36. The storage device of claim 35, the operations further 
comprising transmitting a client system identifier from the 
client system for identifying the client system to the netvvork 
resource. 

37. The storage device of claim 35, vvhere the broadcast of 
the media content includes at least one of a television broad- 
cast, a radio broadcast, a movie broadcast, or an Internet 
broadcast. 

38. A method executed by one or more computers, the 
method comprising: 

receiving, from a client system, a request to create a virtual 

chat room related to a media broadcast being displayed 
at the client system: 

receiving a descriptor from the client system, the descriptor 

identifying the media broadcast being displayed at the 
client system, vvhere the descriptor is generated from 
content of the media broadcast by the client system: 


US 8,479,225 B2 


21 


automatically creating the virtual chat room, including 
automatically selecting participants of the virtual chat 
yoom from users accessing the media content based at 
least in part on a match betvveen the descriptor and at 
least one other descriptor from another client system that 
also identifies the media content, 

providing the virtual chat room to the client system for 
display in association vvith the media broadcast being 
displayed: 

updating chat content to be shared in the virtual chat room 
in an ongoing manner from the participants selected 
based at least in part on the matching descriptors iden- 
tifying the media broadcast: and 

sending the updated chat content to the client system in an 
ongoing manner for display in the virtual chat room, 

vvhere the method is performed during the media broad- 
cast. 

39. A method executed by one or more computers, the 


method comprising: 


on a client system, displaying media content and a com- 
menting medium display area on a display: 

receiving, on the client system, an input to create a virtual 
chat room related to the media content being displayed 
at the client system: 


20 


22 


transmitting a descriptor identifying the media content to a 
netvvork resource, the descriptor generated from the 
media content by the client system, 

receiving, İrom the netvvork resource, information related 
to the virtual chat room, the virtual chat room having 
participants automatically selected from users accessing 
the media content based at least in part on a match 
betvveen the descriptor and another descriptor from 
another client system, the other descriptor also identify- 
ing the media content: 

providing the virtual chat room fordisplay in thecomment- 
ing medium display area, 

receiving chat content from the netvvork resource, the chat 
content being updated in an ongoing manner from the 
participants selected based at least in part on the match- 
ing descriptors identifying the media content: and 

displaying the updated chat contentin the virtual chatroom 
in association vvith the media content in an ongoing 
manner, 

vvhere the method is performed during a broadcast of the 
media content. 


